Using partial information to improve dialog in automatic speech recognition systems

ABSTRACT

A method, system and computer readable device for recognizing a partial utterance in an automatic speech recognition (ASR) system where said method comprising the steps of, receiving, by a ASR recognition unit, an input signal representing a speech utterance or word and transcribing the input signal into text, interpreting, by a ASR interpreter unit, whether the text is either a positive or a negative match to a list of automated options by matching the text with a grammar or semantic database representing the list of automated options, wherein if the ASR interpreter unit results in said positive match proceeding to a next input signal and if the ASR interpreter unit results in said negative match rejecting the text as representing said partial utterance, and processing, by a linguistic filtering unit, the rejected text to derive a correct match between the rejected text and the grammar or semantic database. And, then using the derived word for responding to the user in the next dialog turn in order to reduce or eliminate churn in the human-computer spoken dialog interaction.

FIELD OF THE INVENTION

The present invention generally relates to automatic speech recognition(ASR), more particularly, to a method, system and computer programstorage device for using partial utterances or words to improve dialogin an ASR system.

BACKGROUND OF THE INVENTION

Increasingly, businesses, industries and commercial enterprises, amongothers employ automated telephone call systems with interactive voiceresponse (IVR) offering self-service menus. Instances of contacting anactual human responder are becoming rare. These automated telephone callsystems utilize technologies such as automatic speech recognition (ASR),which allows a computer to identify the speech utterances or words thata caller speaks into their telephone's microphone and match it with thevoice drive menu. Such automated telephone call centers employingexisting ASR technologies are prone to errors in identification andtranslation of a caller's speech utterances and words. With theincreased use of cordless and cellular telephones, the instances oferrors are compounded due to the inherent noise and/or static found insuch wireless systems. Hence, a large percentages of callers' speechutterances or words are distorted such that only partial units ofinformation gets processed by the automated telephone call systemsresulting in re-prompting callers for menu selection choices that userpreviously stated, or erroneous responses by the system, or no responseat all.

A conventional method of automatic speech recognition (ASR) 100 isillustrated in FIG. 1, which requires that a caller first utter a speechutterance or word 110, which is then transcribed into text by ASRtranscription 120 (speech-to-text conversion). The output of the ASRtranscription (or test string) 120 is passed to the ASRinterpreter/grammar module 130 for semantic interpretation orunderstanding. Typically, this form of ASR semantic interpretationusually involves a simple of process of matching the recognized form(e.g. text string) of the caller's speech utterance or word with thepre-defined forms that exist in the grammar. Typically, each matcheditem is assigned a confidence score by the system and so when there is apositive match 140 with a high confidence score then the output is usedby the dialog manager (not shown) to execute the next relevant action,(e.g., transition to a new dialog state or to satisfy the user'srequest) 160.

By contrast, when the recognized text string does not match thepre-defined existing forms in the grammar, this results in an instanceof a negative match or a “No Match,” 150. Consequently, the conventionalASR system 100 will have to increase the error count and give the useradditional tries by returning to the previous dialog state to ask forthe same information all over again 170. The number of retries is avariable that can be set by a voice user interface call flow variablewhere the usual practice is to cap the number of retries to a maximum ofthree, after which the system gives up and caller is transferred to anagent. This is the source of the problem in the current implementation,e.g., the blanket rejection of utterances that do not match (100%) withthe existing pre-defined forms in the grammar. For example, if a callerutters, “I want to speak to the director of Human Language Technology”what may be recognized by the conventional ASR system 100 is onlypartial information such as “-anguage-logy”. Based on the conventionalmatching process, the text strings “language” and “technology” which arepre-defined in the grammar will not match the partial forms “-anguage”and “-logy”, resulting in such partial information being treated as a NoMatch because it is rejected by the ASR interpreter/grammar module 130.As a result the caller is asked to try again by the conventional ASRsystem 100 and so on and so forth until a successful match (translation)is achieved within the limited number of tries else the caller istransferred to the agent.

In some instances, the developer may formulate post-processing ruleswhich will map, for example, partial strings like “-anguage” to fullforms like “language”. The problem is that this is not an automaticprocess, and very often occurs later in the development process (duringthe tuning of the application after some interval from the initialdeployment), and also only some items (high frequency errors) aretargeted for such post-processing rules. In other words, post processingrules are selective (applies to isolated items), manual (not automatic),and costly to implement since it involves human labor. Accordingly, theproblem in conventional ASR systems described above, is that currentspeech systems simply fail to make any fine-grained distinction withinthe No Match classification. In other words, in instances where acaller's utterance or word does not match completely with what is listedin the ASR interpreter/grammar module 130, it is rejected as No Match aslacking any intelligence that can be used to respond to a caller andthus move the dialog with automated telephone call systems along to thenext sequence. Upon reaching the maximum number of retries (and if theerror persists) the call ends up being transferred to an agent. For thesuccess of self-service automation and to increase wider user adoptionof speech systems, it is extremely important to solve this problem,particularly as the majority of users' calls are made from a cordless orcellular phone which, as explained above, have poor quality of receptionthereby increasing the likelihood of a users' utterances or words to bepartially recognized.

Having set forth the limitations of the prior art, it is clear that whatis required is a method, system or computer program storage devicecapable of fine-grained distinction within the No Match classificationof an ASR system to improve the success rate of self service automationin an automated telephone call systems with interactive voice responseself-service menus.

SUMMARY OF THE INVENTION

It is therefore an object of the present invention to provide a method,system and computer program storage device for using partial utterancesto improve dialog in an ASR system.

An additional object of the present invention is to provide a method,system and computer program storage device for recognizing (i.e.,deriving meaningful linguistic information from) a partial utterance inan automatic speech recognition (ASR) system where the method comprisingthe steps of: receiving, by an ASR recognition unit, an input signalrepresenting a speech utterance or word and transcribing the inputsignal into electronic form or a form adopted for comparison,interpreting, by a ASR interpreter unit, whether the text is either apositive or a negative match to a list of automated options by matchingthe text with a grammar or semantic database representing the list ofautomated options, wherein if the ASR interpreter unit results in thepositive match, proceeding to a next input signal, and if the ASRinterpreter unit results in the negative match, rejecting and submittingthe text for evaluation as representing the partial utterance, andprocessing, by a linguistic filtering unit, the rejected text to derivea correct match between the rejected text and the grammar or semanticdatabase.

An additional object of the present invention is to further provide thatthe step of processing, by the linguistic filtering unit, furthercomprises the steps of: determining if the rejected text is a “parsable”speech utterance or word by means of a phonological, morphological,syntactic and/or semantic process (es), wherein each process (es)results in a suggested form of speech utterance or word for each of theprocess(es), assigning a score of +1 for each suggested form of speechutterance or word and ordering the suggested form of speech utterance orword by a cumulative total score, and hypothesizing possible forms ofthe ordered rejected text by comparing each of the suggested form ofspeech utterance with existing words in the grammar or semantic databaseby a context-relevant matching process.

Another additional object of the present invention is to provide thesteps in the voice user interface call flow, of confirming, by a user,whether the hypothesized possible forms of the ordered text is the“intended” speech utterance or word.

BRIEF DESCRIPTION OF THE DRAWINGS

The objects, features and advantages of the present invention willbecome apparent to one skilled in the art, in view of the followingdetailed description taken in combination with the attached drawings, inwhich:

FIG. 1 is an illustration of a conventional automatic speech recognitionsystem according to the prior art; and

FIG. 2 is an illustration of a method, system and computer readablestorage device for automatic speech recognition system capable of usingpartial utterances to appropriately respond to a user in an automaticspeech recognition system in accordance with one possible embodiment ofthe present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Hereinafter, embodiments of the present invention will be described indetail with reference to the accompanying drawings. For the purposes ofclarity and simplicity, a detailed description of known functions andconfigurations incorporated herein will be omitted as it may make thesubject matter of the present invention unclear.

FIG. 2 is an illustration of a method, system and computer readablestorage device for automatic speech recognition (ASR) system capable ofusing partial utterances that have been rejected as not matchingpre-defined forms in the grammar 250 in accordance with one embodimentof the present invention. In operation, the present invention provides amethod of handling the negative match 150 output from the ASRinterpreter 130 as shown in FIG. 1. In other words, the ASR interpreter130 concluded that a caller's utterance 110 or word is a No Match 150and the present invention provides an additional process of determiningwhat the caller has uttered rather than continuing to loop 170 throughseveral iterations of asking the caller to repeat the utterance or word.

As can be seen in FIG. 2, a caller's utterance is rejected 251 asnon-matching and also determined to be containing partial items, i.e.,word fragments or clipped phrases. The non-matching partial items arepassed to a linguistic filter process 252 that includes morphological,syntactic, semantic, and phonological processes for recovering the fullform of the callers' utterance or word. In this regard, the partial itemis evaluated based on the application of each linguistic feature. As anillustration, consider a simple grammar with two pre-defined words: (a)unsuspecting, and (b) unrealistic. Now, a caller speaks their utterancebut only the partial form “un---ting” is recognized. This partialinformation is passed on the Linguistic Filter for evaluation based onthe application of the following components:

Morphological: evaluates the shape of the partial form if it isconsistent with predictable or acceptable morphological forms

Phonological: evaluates the shape of the partial form if it consistentwith predictable or acceptable phonological forms such as syllablestructure information (e.g., it examines questions like, can such astring or syllable occur in word initial position, or word finalposition, etc)Semantic: evaluates if the morpho-phonological form has any correlatingmeaningSyntactic: evaluates if the morpho-phonological and semantic form hasany correlating syntactic class or property including lexical categoryinformation (whether Noun or Verb, or adjective, etc.)For each linguistic process that applies from all four categories, thepartial form is assigned a +1 score.As an illustration, the linguistic filter based on the each of thecomponents described above will apply for the string “un” in thefollowing manner:Phonological=“un” (score+1)Morphological=“un” (score+1)Semantic=“un” (score+1)Syntactic=“un” (score 1)This means that the partial string “un” contains phonological (+1),morphological (+1), semantic (+1) and syntactic (+1) information thatcan be applied for determining the actual word by comparing with theexisting words in the grammar. The cumulative weight from the linguisticfilter for the partial string is a score of 4 based on a positive scorefrom each of the four linguistic components.As an additional illustration, the linguistic filter based on the eachof the components will apply for the string “ting” in the followingmanner:Phonological=“ting” (score+1)Morphological=“ting” (score+1)Semantic=“ting” (score+0)Syntactic=“ting” (score 0)This means that the partial string “ting” contains only phonological(+1) and morphological (+1), information that can be applied fordetermining the actual word by comparing with the existing words in thegrammar. In this instance, the partial string lacks semantic (0) andsyntactic (0) features. Consequently, the cumulative weight from thelinguistic filter for the partial string “ting” is a score of 2 based ona positive score from only two of the four linguistic components.

The linguistic filtering process 252 is followed by an ordering andranking process 253, which sums the partial forms (cumulative scores)resulting from the number of processes matched by the morphological,syntactic, semantic, and phonological properties in the linguisticfilter and posit these as possible forms for the partially recognizedform. Continuing with the example from the previous paragraph, when theordering and ranking process is applied, the following results arederived:

Partial form that was recognized=“un-----ting”

Predefined items in the grammar: “unsuspecting” “unrealistic”

Applying the ordering and ranking process will yield:

[Un]=(4 linguistic properties)

[ting]=(2 linguistic properties)

These processes are ordered in terms of the cumulative scores or valuesfrom the linguistic filtering process 252 to determine if a partial formindeed has enough linguistic evidence for deriving their linguisticstatus 253 and then used for making a direct comparison with theexisting pre-defined words in the grammar 254. As we see from thisillustration, both strings in the partially recognized utterance containsufficient linguistic information (“un” has 4 and “ting” has 2) that canbe used for the evaluation of existing words in the grammar to find theright match in order to make progress in the dialog. Crucially, a stringonly requires a minimum of 1 positive score to be used for this sort ofevaluation.

Next, the ranked ‘reconstructed’ form is compared with existing words inthe grammar database to find the context-relevant matches 254.Context-relevance is calculated on the basis of the existing forms inthe pre-defined grammar. This means that the partial forms are comparedonly to the existing forms in the grammar and nothing else. Thus, basedon the combination of the score from the linguistic filtering process252 along with the context-relevant matching, the most confident form isposited for confirmation to the caller 255. As an illustration, when thepartial form is compared with the two pre-defined words in the grammarthe following results emerge:

“un” and “ting” are partial strings that can be identified with the word“unsuspecting” through the matching process. Furthermore, based on thelinguistic filter results, Un-suspec-ting matches the partial form in atotal of 6 linguistic features (as shown in 0016).By comparison, only one part of the partial strings “un” and “ting” canbe identified in the other word in the grammar “urealistic”. Moreimportantly, Un-realis-tic matches the partial form in only 4 linguisticfeatures (as shown in 0016).Consequently, the caller is offered the highest ranked result in theoutput (unrealistic) and the caller is asked to confirm or reject the“reconstructed” word in the ensuing dialog.

Thus, for example, when a caller says “I want to see if there is aproblem with the --otes” where the first syllable of “notes” is clippedoff. Or in the example provided above where only “-anguage-logy” isrecognized, instead of classifying these into the No Match bucket the“partial string” is sent to the ASK interpreter and used in comparingthe list of related forms in the grammar. The grammar (interpreter)already includes the full form of the relevant phrases that a callermight say. Accordingly, by comparing with existing forms in the grammar,the system will produce a list of related forms and then rank these withrelative confidence of ‘closeness’ computed from context-relevance(e.g., how much they match existing forms using a linguistic filter).Then, the user is given a chance to confirm or refine the partiallyrecognized form. Based on this process, instead of rejecting a partialutterance, the system will come back with its best guess about thecallers' intended word using a matching algorithm in the linguisticfilter to reconstruct the utterance's meaning, form, or structure andthen offer the caller a more intuitive way to confirm or refine what wasrecognized without necessarily losing a dialog turn. In this regard, thevoice user interface (VUI) call flow may provide a re-prompt, such as,“I am sorry I didn't catch all of that, did you say you want help with“notes”? The “reconstructed” word from the partially recognizedutterance is offered in the dialog response by the computer systeminstead of the conventional re-prompt that says “I'm sorry, I did notcatch that. Please say that again” which typically results in multiplere-tries and subsequently with the caller being transferred to a humanAgent.

Moreover, the present invention contrasts existing approaches, which use(a) confidence score and, or (b) n-best list to determine the confidenceor legitimacy of items in speech recognition grammars. By definition andprocess, these approaches consistently fail to apply to partiallyrecognized forms because they operate on fully well-formed words orutterances. Instead, the present invention provides a new approach todetermining the confidence or legitimacy of partially recognized wordsor utterances whereby each linguistic feature in the Linguistic Filteris automatically applied in trying to recover or match the partial formand then using the output from the filter for comparing the“reconstructed” words from the partial items with the existing fullforms already in the grammar. As previously explained, each linguisticfeature that applies to a partial string gets assigned a score of +1.The cumulative weight derived from adding up all the positive countsfrom the linguistic features is then used for determining the legitimacyof the word. The matching word with the highest number of positivefeatures is postulated as the actual word that the user had originallyspoken (which was partially recognized) and this “reconstructed” word isoffered in the subsequent dialog with the user.

As will be readily apparent to those skilled in the art, the presentinvention or aspects of the invention can be realized in hardware, or assome combination of hardware and software. Any kind of computer/serversystem(s)—or other apparatus adapted for carrying out the methodsdescribed herein—is suited. A typical combination of hardware andsoftware could be a general-purpose computer system with a computerprogram that, when loaded and executed, carries out methods describedherein. Alternatively, a specific use computer, containing specializedhardware for carrying out one or more of the functional tasks of theinvention, could be utilized.

The present invention or aspects of the invention can also be embodiedin a computer program product, which comprises all the respectivefeatures enabling the implementation of the methods described herein,and which—when loaded in a computer system—is able to carry out thesemethods. Computer program, software program, program, or software, inthe present context mean any expression, in any language, code ornotation, of a set of instructions intended to cause a system having aninformation processing capability to perform a particular functioneither directly or after either or both of the following: (a) conversionto another language, code or notation; and/or (b) reproduction in adifferent material form.

The present invention can also be embodied as a program on acomputer-readable recording medium. Examples of the computer-readablerecording medium include but are not limited to Compact Disc Read-OnlyMemory (CD-ROM), Random-Access Memory (RAM), floppy disks, hard disks,and magneto-optical disks.

While there has been shown and described what is considered to bepreferred embodiments of the invention, it will, of course, beunderstood that various modifications and changes in form or detailcould readily be made without departing from the spirit of theinvention. It is therefore intended that the scope of the invention notbe limited to the exact forms described and illustrated, but should beconstrued to cover all modifications that may fall within the scope ofthe appended claims.

1. A method of correctly determining content or meaning from a partialspoken utterance in an automatic speech recognition (ASR) system, saidmethod comprising the steps of: receiving, by a ASR recognition unit, aninput signal representing a speech utterance or word and transcribingsaid input signal into a representative electronic textual form;interpreting, by a ASR interpreter unit, whether said representativeelectronic textual form is either a positive or a negative match to alist of automated options by matching said representative electronictextual form with a grammar or semantic database representing said listof automated options, wherein if said ASR interpreter unit results insaid positive match proceeding to a next input signal and if said ASRinterpreter unit results in said negative match rejecting and submittingsaid representative electronic textual form as representing said partialutterance; processing, by a linguistic filtering unit, said rejectedrepresentative electronic textual form to derive a correct match betweensaid rejected representative electronic textual form and said grammar orsemantic database; and determining by said linguistic filtering unit, ifsaid rejected representative electronic textual form is said speechutterance or word by a phonological, morphological, syntactic and/orsemantic process(es), wherein each process(es) results in a suggestedform of speech utterance or word for each of the process(es); assigninga score for each suggested form of speech utterance or word and orderingsaid suggested form of speech utterance or word according to acumulative total score; and comparing each said suggested form of speechutterance with existing words in said grammar or semantic database by acontext-relevant matching process; and hypothesizing possible forms ofsaid ordered rejected text based on said comparing.